Application of machine learning algorithms for localized syringe services program policy implementation – Florida, 2017

Abstract Background People who inject drugs (PWID) are at an amplified vulnerability for experiencing a multitude of harms related to their substance use, including viral (e.g. HIV, Hepatitis C) and bacterial infections (e.g. endocarditis). Implementation of evidence-based interventions, such as syringe services programs (SSPs), remains imperative, particularly in locations at an increased risk of HIV outbreaks. This study aims to identify communities in Florida that are high-priority locations for SSP implementation by examining state-level data related to the substance use and overdose crises. Methods State-level surveillance data were aggregated at the ZIP Code Tabulation Area (ZCTA) (n = 983) for 2017. We used confirmed cases of acute HCV infection as a proxy of injection drug use. Least Absolute Selection and Shrinkage Operator (LASSO) regression was used to develop a machine learning model to identify significant indicators of acute HCV infection and high-priority areas for SSP implementation due to their increased vulnerability to an HIV outbreak. Results The final model retained three variables of importance: (1) the number of drug-associated skin and soft tissue infection hospitalizations, (2) the number of chronic HCV infections in people aged 18–39, and 3) the number of drug-associated endocarditis hospitalizations. High-priority SSP implementation locations were identified in both urban and rural communities outside of current Ending the HIV Epidemic counties. Conclusion SSPs are long researched, safe, and effective evidence-based programs that offer a variety of services that reduce disease transmission and assist with combating the overdose crisis. Opportunities to increase services in needed regions across the state now exist in Florida as supported by the expansion of the Infectious Disease Elimination Act of 2019. This study provides details where potential areas of concern may be and highlights regions where future evidence-based harm reduction programs, such as SSPs, would be useful to reduce opioid overdoses and disease transmission among PWID. Key messages The rate of acute HCV in Florida in 2017 was 1.9 per 100,000, nearly twice the national average. Serious injection related infections among PWID are significant indicators of acute HCV infection. High-priority SSP implementation locations in Florida were identified in both urban and rural communities, including those outside of current Ending the HIV Epidemic counties.


Introduction
Due to the convergence of the opioid and stimulant crises in the United States [1][2][3], there has been a significant increase in the prevalence of people who inject drugs (PWID), as well as incidence of overdose death [4,5]. PWID are at an amplified vulnerability for experiencing a multitude of harms related to their substance use, including viral (e.g. HIV, Hepatitis C) and bacterial infections (e.g. skin and soft tissue, infective endocarditis) [6][7][8] and fatal overdose [9]. In 2018, approximately 10% of new HIV infections were related to injection drug use (IDU) [10], and IDU has been the primary risk factor for the rising rate of acute hepatitis C virus (HCV) infections across the U.S [11]. In addition, hospitalizations related to IDU-associated bacterial infections, such as infective endocarditis, have been significantly increasing over the last 10 years [12,13].
While the number of HIV diagnoses among PWID steadily decreased between 2010 and 2015 [14], IDUassociated HIV outbreaks linked to opioid and other concurrent substance use disorders [15][16][17][18][19][20] have contributed to a significant increase in HIV diagnoses among this vulnerable population. This concerning trend has generated local and national focus on rapid recognition of HIV outbreaks and implementation of control measures to mitigate further transmission. In 2016, the Centres for Disease Control and Prevention published a nationwide assessment of U.S. counties most vulnerable to rapid spread of IDU-associated HIV [21], utilizing county-level acute HCV infection as a proxy measure for IDU. Results from this analysis highlighted two important findings: (1) social and economic conditions are significantly related to acute HCV, and (2) the most vulnerable counties lacked sufficient harm reduction-based HIV prevention strategies for PWID, specifically syringe services programs (SSPs). Recent research has corroborated these findings and has expanded to investigate the utility of risk environment frameworks to better understand the physical, social, and economic influences at the communitylevel, providing robust context to drivers of drugrelated harms [22].
The methodology presented in Van Handel et al. (2016) has been adopted by state health departments to understand localized context of vulnerable counties for rapid HIV spread among PWID to geographically target the implementation of HIV prevention interventions [23] and has been extended to zip code [24] and census tract [25] geographical levels. These national and state-level analyses have led to a significant increases in authorization and implementation of SSPs across the country [26]. However, SSPs have, and continue to, experience significant political opposition, which have led to closures in highly vulnerable locations (West Virginia and Indiana) [27]. In 2019, the Florida Legislature passed the Infectious Disease Elimination Act (IDEA) authorizing the expansion of SSPs across the state by allowing counties to pass ordinances to implement these harm reduction programs in their respective jurisdictions [28]. Florida has been severely impacted by the syndemic of overdose, HIV infection, and HCV infection. In 2020, 6,089 Floridians died from a opioid-related overdose [29], and 7 of the 67 counties (Miami-Dade, Broward, Palm Beach, Hillsborough, Pinellas, Orange, and Duval) have been identified as high-priority counties under the Ending the HIV Epidemic: Plan for America initiative [30]. Taken together, with the expansion of SSPs, it is imperative to understand the highest-priority counties and zip-code level locations in Florida for local HIV prevention policy and program implementation for PWID.
The current methodology used for identifying vulnerable locations has been comprised of a multi-step process, including assessment for multicollinearity, variable reduction (e.g. principal component analysis), and regression modelling that is subject to overfitting [21,23,31]. The use of machine learning (ML) algorithms has become more common in the field of HIV prevention research, offering a flexible method to evaluate large and complex data. ML is broadly defined as the process by which computational and statistical algorithms "learn" from data [32]. There are a myriad of ML algorithms used in practice, ranging in complexity, applicability, and functionality (e.g. regression, regularization, decision tree, Bayesian, deep learning, and clustering). These learning algorithms have been applied in HIV research, including the creation of prediction tools for providers to identify candidates for PrEP [33,34], determining factors associated with HIV testing among high-risk groups [35], identifying individuals at high-risk for HIV acquisition [36], and has been recently used in predicting vulnerable locations for overdose, HIV, and HCV [25]. This study aims to identify jurisdictions in Florida that are high-priority locations for rapid SSP implementation by applying a ML algorithm to state-level data that are related to the substance use and infectious disease epidemics.  [37]. Acute HCV incidence was defined as newly diagnosed by positive HCV antibody and/or positive RNA nucleic acid amplification test with discrete onset of symptoms consistent with acute viral hepatitis (e.g. fever, headache, malaise, anorexia, nausea, vomiting, diarrhea, and abdominal discomfort) and either jaundice or elevated liver enzymes (serum alanine aminotransferase [ALT] level >200 IU/L) during the period of acute illness. All data were aggregated as counts by ZIP code for 2017. State-level health variables included in the models were: number and rate of deaths related to all drugs, number and rate of deaths related to heroin and opioids only, number and rate of deaths related to multiple substances, rate of nonfatal drug overdoses (all drugs), rate of nonfatal drug overdoses (opioid), number and rate of sexually transmitted infections (STIs i.e. syphilis, gonorrhoea, and chlamydia), number and rate of chronic HCV infections in people between the ages of 18-39 (defined as laboratory confirmed positive HCV RNA AND does not meet the case definition of acute hepatitis C), and the number and rate of serious injection related infection (SIRI) hospitalizations. SIRI included infective endocarditis, skin and soft tissue infections (SSTIs), osteomyelitis, and bacteraemia and sepsis. SIRI were determined based on a validated ICD-10 algorithm, and more indepth description of the methodology identifying these infections is published elsewhere [38].

Study design and setting
Since state-level data were collected at the ZIP code level and ACS variables were collected at the 5digit ZCTA-level, ZIP codes were transformed into corresponding ZCTAs using the Uniform Data System Mapper "ZIP Code to ZCTA crosswalk" calculator [39]. ZCTAs are generalized areal presentations of ZIP Code service areas created by the U.S. Census Bureau to develop a geographical boundary.  Table 2). ACS-specific estimates included in the models were: estimated total population, percentage of population aged 18-24, percentage of persons without health insurance, percentage of households with a vehicle available, percentage of people with no high school diploma (!25 years old), per capita income, percent of people living in poverty (based on Censusdefined poverty levels), income inequality Gini coefficient, percentage of the total population that is non-Hispanic White, non-Hispanic Black, and Hispanic, total housing units, number of vacant housing units, percentage of vacant housing units, number of mobile homes, percentage of mobile homes, percentage of homes with no phone service, and percentage of the population never married. Variables with non-normal distributions were log-transformed, such as per capita income.
2.1.2. Statistical analysis 2.1.2.1. Spatial autocorrelation. Due to the geographical nature of this analysis, we examined the spatial distribution of our outcome variable (rate of acute HCV infection) across ZCTAs to understand the spatial autocorrelation in the outcome variable. We used the global Moran's I statistic to evaluate whether there was a significant clustering pattern in our outcome variable [40,41]. Once Moran's I was computed, we used Monte Carlo simulation to determine the normal distribution of Moran's I with our outcome variable if data were spatially random [42]. Upon investigation, we determined that there was significant spatial autocorrelation (Moran's I ¼ 0.0965, p ¼ .001) across ZCTAs, suggesting that neighbouring ZCTAs have similar rates of acute HCV infection, with high-high and low-low clustering. To account for the significant correlation in our outcome variable, we included a spatial autocorrelation measure in our model by averaging the rate of acute HCV infection among each ZCTA's five closest neighbours [43].

Model development.
Using data collected from the ACS 2013-2017 5-year estimates and statelevel surveillance in the state of Florida, we fitted models to predict acute HCV infection at the ZCTAgeographical level. Based on the distribution of the outcome variable, we used a standard Poisson regression model using Least Absolute Shrinkage and Selection Operator (LASSO). The LASSO regression procedure performs L1 regularization, optimizing predictive accuracy by automating the selection of variables through shrinkage and elimination of non-significant variables by setting them to zero [44]. LASSO works by applying a shrinkage penalty lambda (k), or tuning hyperparameter, to the regression coefficients through minimization of the sum of squares. Increasing the lambda value increases bias in the model and allows for more and more coefficients to be set to zero and eliminated from the model (i.e. variable selection). To reduce overfitting, improve model performance, and determine the optimal regularization parameter, we divided the overall dataset into a training dataset and a validation dataset. Data were randomly split with 70% of the data being used for model training and 30% of the data being used for validating model performance. Using the training dataset only, we used k-fold cross-validation to determine the optimal, user-defined lambda value [45,46]. A vector of potential lambda values ranging from 10 À5 to 10 5 was created to determine optimal lambda value. The optimal lambda value was determined by the minimization of the root mean squared error (RMSE), and the optimal lambda value was used in the final model ( Figure 1). Parameter estimates were determined for the model using the optimal lambda value with a Poisson distribution. Based on the number of zeros in the outcome variable (72%), we tried to fit models with a negative binomial and zero-inflated Poisson distribution. However, these models failed to converge. In addition, we explored using Random Forest (RF) as an additional specification check to assess issues with the data imbalance (preponderance of zero values) and potential high-order interactions. Due to the RF model corroborating our findings from the LASSO model, we decided to proceed with the LASSO model. To assess how well the model performed on unseen data, the model trained on the training dataset was used to determine predictive accuracy on the validation dataset (i.e. the remaining 30% of the data). The RMSE of the model on the training and validation datasets were computed and compared for performance.

Variable of importance.
We further evaluated the variable importance rankings to identify which variables had the strongest predictive value of acute HCV infection. The variables selected by the model with the optimum lambda value were determined and reported to understand which variables have the most predictive power. All analysis was completed using the caret and glmnet packages in R 4.0.1 statistical environment.

Vulnerability mapping. Shapefiles for 2017
ZCTAs for the state of Florida were downloaded using the tigris package. Shapefiles were merged with the predicted values of acute HCV infections from the training and final predictive model and mapped using the ggplot2 package. Predicted values were split into deciles to understand the highest-priority areas for SSP implementation, defined as the 90th percentile of all ZCTAs with the highest predicted acute HCV infection. All mapping procedures were performed in R 4.0.1 statistical environment. The optimal model from the training data was used to predict outcome for both training and validation data to provide vulnerability mapping for all ZCTAs in Florida.

Results
In 2017, of the 983 ZCTAs in Florida, 404 acute HCV infections were reported to the Florida Department of Health, with an overall incidence of 1.9 per 100,000, nearly twice the national average [47]. Acute HCV incidence across ZCTAs ranged from 0 to 46.9 per 100,000. A detailed overview of each feature's description, data source, mean, median, and interquartile range (IQR) is presented in Tables 1 and 2 and a correlation matrix of all features is presented in Figure 2.

Results of the training LASSO and model validation
Using 10-fold cross-validation, the optimal lambda value in the LASSO training model that produced the lowest RMSE was k ¼ 0.561 (Figure 1  zero and eliminated from the model. When applied to the validation dataset, the RMSE of model was 4.44, suggesting the model had good predictive performance and minimal overfitting.

Vulnerability mapping
Based on the predicted values obtained from the training and validation model, high-priority areas were located both in urban and rural settings, even outside of the current Ending the HIV Epidemic jurisdictions (Figure 4). There were 27 counties that contained the 99 ZCTAs that were identified as high priority (

Discussion
This ecological study provides important information regarding high-priority locations in Florida for the implementation of HIV prevention programs (i.e. SSPs) to serve PWID, a population vulnerable to the rapid transmission of HIV infection [15,17,18,48]. Our analysis provides state, county, and community-level stakeholders (e.g. health departments) granular information regarding where resource allocation should be focused and planning for localized SSP implementation. This study also highlights the utility of state-level surveillance data integration across departments and data sources. Through the application of a machine learning algorithm, we identified significant indicators for acute HCV infection, such as chronic HCV infection among people aged 18-39, drug-associated skin and soft tissue hospitalizations and drug-associated infective endocarditis hospitalizations. Our data suggest a significant relationship between chronic HCV among people aged 18-39 and acute HCV incidence. Previous research has suggested that there is a plausible mechanistic relationship between chronic HCV and HCV incidence through geographical variability in community viral load [49]. Areas with high burden of active and untreated HCV may serve as a HCV reservoir, increasing the probability of HCV being transmitted during sharing of injection equipment among PWID in the absence of prevention [50]. With increasing prevalence of younger PWID [51] and increasing rates of chronic HCV among persons under the age of 39 years old [11] coupled with limited access to curative HCV treatment due to sobriety restrictions and a historical lack of HCV prevention (i.e. SSPs) among PWID in Florida, a multifaceted approach through treatment access and scaling up prevention remains imperative in the control of HCV.
These results also expand on state-level variables collected in existing surveillance systems by examining IDU-associated bacterial infections among a cohort of PWID identified by ICD-10 codes. The results from the final model highlight the compounding harms that PWID face outside of viral infections (e.g. HCV and HIV), suggesting that a state-wide surveillance system of bacterial infections (e.g. infective endocarditis) should be developed to better track and understand the trends of infectious sequelae due to the substance use and overdose crises.
The machine learning algorithm predicted well but showed room for improvement in prediction performance with the algorithm's RMSE value >4 and R-squared value <0.10. RMSE is an absolute measure of fit, providing information on how close the observed data points are to the model's predicted values [52]. This may be, in part, due to the relative imbalance in acute HCV infections. Many (72%) of the ZCTAs did not report any acute HCV infection, and modelling of relatively rare events can be difficult. Because LASSO regression simultaneously performs variable selection/retention, we produced a parsimonious model of 3 features which improves simplicity in understanding the final model. This analysis contextualizes, geographically, highpriority ZCTAs for implementation of prevention services for PWID ( Figure 4). With the expansion legislation passed to allow all counties in Florida to implement SSPs in 2019, counties that contain ZCTAs in the 90 th percentile should emergently look to support and pass local legislation to implement these evidencebased programs. The effectiveness and cost-effectiveness of SSPs as a public health strategy are well established [53][54][55][56], garnering support from the Centres for Disease Control and Prevention and explicitly named as a cornerstone program in the "Prevent" pillar of the Ending the HIV Epidemic initiative. To date, there have been 9 counties (Miami-Dade, Broward, Palm Beach, Hillsborough, Pinellas, Manatee, Leon, Alachua and Orange) that have passed local ordinances authorizing an SSP within their respective jurisdictions, 7 of which were identified as counties containing high-priority ZCTAs. While the majority of high-priority counties under the Ending the HIV Epidemic initiative have passed ordinances, this analysis highlights additional locations where local SSP implementation is imperative, including both urban (85%) and rural (15%) counties (defined by the 2010 Census). The counties identified in this analysis closely match the drug-related overdose deaths by county in 2017 [57], highlighting the syndemic opioid and overdose crises faced by Florida counties.
Based on the significant predictors of acute HCV infection, state policymakers and community stakeholders should assess the implementation of harm reduction and behavioural interventions in medicalbased settings, such as emergency departments where PWID are frequent utilizers [58]. There has been increased focus on the integration of addiction medicine and infectious disease specialties to develop "Serious Injection-Related Injury (SIRI)" teams due to the significant increase in infections like infective endocarditis [59,60]. These teams are focused on providing both gold standard antibiotic therapies and evidence-based substance use disorder treatment among patients hospitalized with SIRIs [61,62] to optimize health-related outcomes. These teams are well positioned to deliver harm reduction interventions to PWID, including linkage to HIV prevention (e.g. PrEP), HIV and HCV treatment, and outpatient medications for opioid use disorder [63].
Beyond additional interventions, these findings, and the model, have important implications for the prediction and prevention of IDU-associated HIV outbreaks. Research has demonstrated that outbreaks of IDUassociated HCV may proceed the rapid transmission of HIV, most salient in the Scott County, Indiana outbreak [64,65]. In 2018, Miami-Dade county detected an outbreak of HIV among their PWID population after the implementation of an SSP in December 2016 [15]. Based on the results of this model using data from 2017, Miami-Dade county contained 2 ZCTAs that were identified as high-priority areas, of which one was the exact ZCTA where the outbreak was identified, investigated, and mitigated by the local SSP and the Florida Department of Health. This convergence of predicted and detected outbreaks may highlight the practical utility of this model to identify outbreaks in Florida. In addition, bacterial infections, such as SSTIs and infective endocarditis, could be further upstream indicators of HCV and HIV infection, highlighting the importance of incorporating these infections in the prediction of HIV outbreaks in future research [66].

Limitations
This analysis is subject to several limitations. First, there is a lack of accurate and robust surveillance reporting for acute HCV infection and other state-level data, such as drug-related deaths and EMS calls for a drug-related overdose. This also includes changes in case definitions over time, underreporting, and misclassification that can cause issues with the reliability of the data being modelled. However, we utilized only 2017 data on acute HCV infection in which a consistent case definition was applied across the year, and these data are the best measures available at the state level. In addition, PWID often avoid health care services due to pervasive stigma [67] remain hesitant to call 911 when responding to an overdose [68], and use naloxone distributed by SSPs in the field [69] suggesting that existing data sources are limited in capturing representative metrics. However, at the time of this study, Miami-Dade was the only county with street-level distribution of naloxone so these unreported nonfatal overdoses would not impact the model outside of Miami-Dade County in 2017. Second, our data were only limited to a cross-sectional framework, not allowing for forecasting and including spatiotemporal dynamics in the data to map risk in space and time. In addition, the final model from our 10-fold cross-validation was used to make predictions on both the training and validation datasets in order to obtain predicted values for all ZCTAs for vulnerability mapping, therefore the values for the training data are fitted values and the values for the validation data are truly predicted values. Therefore, the two subsets of ZCTAs may have differing accuracy. Third, the most significant variables in our models were variables that are not routinely collected by the state. This exclusion poses potential issues with the ability to rapidly apply this methodology to new data when available, although it does point to potential important data to add to the state's surveillance efforts. The Agency for Health Care Administration (AHCA) in Florida is responsible for collection and management of claims data which could be utilized to provide these data on a timely basis. Fourth, this study utilized "black-box" prediction algorithms that increase the complexity of understanding how and which variables are driving prediction. However, Variable Importance Index (VIMP) can provide insights into how variables influence prediction by ranking which variables are most important in the model. Fifth, the machine learning algorithm used can be sensitive to class imbalance, which may have resulted in suboptimal predictive performance of the model. Zero-inflated, negative binomial, and Random Forest models were explored; however, the zero-inflated and negative binomial models did not converge and the Random Forest model corroborated our results from the LASSO model. Lastly, high correlation between features in the models may have impeded model performance and variable importance ( Figure 2). Nonetheless, taken together, this analysis provides a more robust methodology and granular understanding of high-priority areas for SSP implementation.

Conclusions
SSPs offer a multitude of benefits for PWID. This study provides an application of machine learning algorithms that can help provide a streamlined methodology to be used by states undertaking their own vulnerability assessments. Future research should explore longitudinal modelling approaches in order to improve prediction and forecasting of risk in space and time. This study also expands on the geographical unit of analysis, providing granular data at the ZCTAlevel instead of the county-level. The results from this analysis should be disseminated to local health departments to inform the targeted expansion of services for PWID, including SSPs, HIV/HCV testing and treatment, naloxone distribution, and community outreach to prevent HCV and HIV infection among this high incidence community.